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ABSTRACT 

Motivation: Large-scale methods for inferring gene trees are error- 
prone. Correcting gene trees for weakly supported features often re- 
sults in non-binary trees, i.e. trees with polytomies, thus raising the 
natural question of refining such polytomies into binary trees. A feature 
pointing toward potential errors in gene trees are duplications that are 
not supported by the presence of multiple gene copies. 
Results: We introduce the problem of refining polytomies in a gene 
tree while minimizing the number of created non-apparent duplications 
in the resulting tree. We show that this problem can be described as a 
graph-theoretical optimization problem. We provide a bounded heur- 
istic with guaranteed optimal ity for well-characterized instances. We 
apply our algorithm to a set of ray-finned fish gene trees from the 
EnsembI database to illustrate its ability to correct dubious 
duplications. 

Availability and implementation: The C++ source code for the al- 
gorithms and simulations described in the article are available at 
http://www-ens.iro. umontreal .ca/~lafonman/software. php . 
Contact: lafonman@iro. umontreal. ca or mabrouk@iro. umontreal. ca 
Supplementary information: Supplementary data are available at 
Bioinformatics online. 



1 INTRODUCTION 

With the increasing number of completely sequenced genomes, 
the task of identifying gene counterparts in different organisms 
becomes more and more important. This is usually done by clus- 
tering genes sharing significant sequence similarity, constructing 
gene trees and then inferring macro-evolutionary events such as 
duplications, losses or transfers through reconciliation with the 
phylogenetic tree of the considered taxa. The inference of accur- 
ate gene trees is an important step in this pipeline. While gene 
trees are traditionally constructed solely from sequence align- 
ments (Guidon and Gascuel, 2003; Ronquist and Huelsenbeck, 
2003; Saitou and Nei, 1987), recent methods incorporate infor- 
mation from species phytogenies, gene order and other genomic 
footprint (Akerborg et aL, 2009;Boussau et al., 2013; Durand 
et aL, 2003; Rasmussen and Kellis, 2011; Szollosi et aL, 2013; 
Thomas, 2010; Wapinski et aL, 2007). A large number of gene 
tree databases are now available (Datta et aL, 2009;Flicek, 2012; 
Huerta-Cepas et aL, 2011; Mi et aL, 2012; Schreiber et aL, 2013). 

But constructing accurate gene trees is still challenging; for 
example, a significant number of nodes in the EnsembI gene 
trees are labelled as 'dubious' (Flicek, 2012). In a recent study, 
we have been able to show that ^^30% of 6241 EnsembI gene 
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trees for the genomes of the fishes Stickleback, Medaka, 
Tetraodon and Zebrafish exhibit at least one gene order incon- 
sistency and thus are likely to be erroneous (Lafond et aL, 2013). 
Moreover, owing to various reasons such as insufficient differ- 
entiation between gene sequences and ahgnment ambiguities, it is 
often difficult to support a single gene tree topology with high 
confidence. Several support measures, such as bootstrap values 
or Bayesian posterior probabihties, have been proposed to detect 
weakly supported edges. Recently, intense efforts have been put 
towards developing tools for gene tree correction (Berglund- 
Sonnhammer et aL, 2006; Chaudhary et aL, 2011; Chen et aL, 
2000; Doroftei and El-Mabrouk, 2011; Gorecki and Eulenstein, 
2011a,b; Nguyen et aL, 2013; Swenson et aL, 2012; Wu et aL, 
2012). A natural approach is to remove a weakly supported edge 
and collapse its two incident vertices into one (Beiko and 
Hamilton, 2006), or to remove 'dubious' nodes and join resulting 
subtrees under a single root (Lafond et aL, 2013). The resulting 
tree is non-binary with polytomies (multifurcating nodes) repre- 
senting unresolved parts of the tree. A natural question is then to 
select a binary refinement of each polytomy based on appropri- 
ate criteria. This has been the purpose of a few theoretical and 
algorithmic studies conducted in the past years, most of them 
based on minimizing the mutation (i.e. duplication and loss) cost 
of reconciliation (Chang and Eulenstein, 2006; Lafond et aL, 
2012; Vernot et aL, 2009; Zheng et aL, 2012). 

In the present article, we consider a different reconciliation 
criterion for refining a polytomy, which consists in minimizing 
the number of non-apparent duplication (NAD) nodes. A duph- 
cation node x of a gene tree (according to the reconcihation with 
a given species tree) is a NAD if the genome sets of its two 
subtrees are disjoint. In other words, the reason x is a duphcation 
is not the presence of paralogs in the same genome, but rather an 
inconsistency with the species tree. Such nodes have been flagged 
as potential errors in different studies (Chauve and El-Mabrouk, 
2009; Flicek, 2012; Scornavacca et aL, 2009). In particular, they 
correspond to the nodes flagged as 'dubious' in EnsembI gene 
trees. 

We introduce the polytomy refinement problem in Section 2, 
and we show in Section 3 how it reduces to a clique decompos- 
ition problem in a graph representing speciation and duplication 
relationships between the leaves of a polytomy. We develop a 
bounded heuristic in Section 4, with guaranteed optimahty in 
well-characterized instances. In Section 5 we exhibit a general 
methodology, using our polytomy refinement algorithm, for cor- 
recting NAD nodes of a gene tree. We then show in Section 6 
that this approach is in agreement with the observed corrections 
of EnsembI gene trees from one release to another. 
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2 THE POLYTOMY REFINEMENT PROBLEM 

Phylogenies and reconciliations. A phylogeny is a rooted tree that 
represents the evolutionary relationships of a set of elements 
(such as species, genes, . . . ) represented by its nodes: internal 
nodes are ancestors, leaves are extant elements and edges repre- 
sent direct descents between parents and children. We consider 
two kinds of phylogenies: species trees and gene trees. A species 
tree S describes the evolution of a set of related species, from a 
common ancestor (the root of the tree), through the mechanism 
of speciation. For our purpose, species are identified with gen- 
omes, and genomes are simply sets of genes. As for a gene tree, it 
describes the evolution of a set of genes, through the evolution- 
ary mechanisms of speciation and duplication. Therefore, each 
gene g, extant or ancestral, belongs to a species denoted by s(g). 
The set of genes in a gene tree is called a gene family. A leaf-label 
corresponds to a genome in a species tree, and to a gene belong- 
ing to a genome in a gene tree. 

Given a phylogeny T, we denote by /(7) the leaf- set and by 
V{T) the node- set of T. Given a node x of T, we denote by l{x) 
and call the clade of x, the leaf-set of the subtree of 7" rooted at x. 
We call an ancestor of x any node y on the path from the root of 
T to the parent of x. In this case we write y<x. Two nodes x, y 
are unrelated if none is an ancestor of the other. For a leaf subset 
X of T, IcariX), the lowest common ancestor (LCA) of X in T, 
denotes the farthest node from the root of T, which is an an- 
cestor of all the elements of X. In this article, species trees are 
assumed to be binary: each internal node has two children, 
representing its direct descendants (see S in Fig. 1). For an in- 
ternal node X of a binary tree, we denote by xg and x^ the two 
children of x. 

Definition 1. (Reconciliation) A reconciliation between a binary 
gene tree G and a species tree S consists in mapping each internal 
or leaf node x of G ( representing respect, an ancestral or extant 
gene) to the species s(x) corresponding to the LCA in of the set 
{s{l), for all / G l{x)]. Every internal node x of G is labelled by an 
event E(x) verifying: E{x)= Speciation (S) if s(x) is different 
from s{xi) and s{xr), and E{x) = Duplication otherwise. 

We define two types of duplication nodes of a gene tree G. A 
N on- Apparent- Duplication (NAD) is a duplication node x of G 
such that {\Jxei{xi)^) n {\Jy^ii^xr)y) = 0- A duplication that is not an 
NAD is an apparent duplication (AD) node, i.e. a node with the 
left and right subtrees sharing a common leaf-label. Therefore, 
any internal node x of G is of type S, AD or NAD. 

The gene trees we consider might be non-binary. We call polyt- 
omy a gene tree with a non-binary root (see T in Fig. 1). 

Definition 2. (Binary refinement) A tree Ht is a refinement of a 
tree T if and only if the two trees have the same leaf -set and T can 
be obtained from by contracting some edges. When refines 
T, each node of T can be mapped to a unique node of Ht so that the 
ancestral relationship is preserved. Ht is a binary refinement of T 
if and only if is binary and is a refinement of T. 

In this article, as only binary refinements are considered, we 
omit the term binary from now. 

Problem statement. The general problem we address is the fol- 
lowing: Given a non-binary gene tree G and a species tree S, find 
a refinement of G containing the minimum number of NADs with 



respect to S. Such a refinement of G is called a minimum refinement 
of G w.r.t. S. 

Hence, we aim at refining each non-binary node of G. We first 
show that each such non-binary node of G can be refined inde- 
pendently of the other non-binary nodes. 

Theorem 1. Let {G/, for \ <i <n] be the set of subtrees of G 
rooted at the n children {x/, for \ <i <n} of the root of G. Let 
Hrnin{Gi, S) be a minimum refinement of Gi w.r.t. S. Let G' be the 
tree obtained from G by replacing each Gi by Hmin{Gi, S). Then a 
minimum refinement of G is a minimum refinement of G' . 

It follows from Theorem 1 that a minimum refinement of G 
can be obtained by a depth-first procedure iteratively solving 
each polytomy G^, for each internal node x of G. 

In the rest of this paper, we consider G as a polytomy, and we 
denote by T the forest {Gi, G2, . . . G„} obtained from G by 
removing the root. For simplicity, we make no difference be- 
tween a tree G/ of T and its root. In particular, ^(G/) corresponds 
to s(root(Gi)), where root(Gi) is the root of G/ (Fig. 1). We are 
now ready to define the main optimization problem we consider. 
Minimum NAD polytomy refinement (MinNADref) problem: 
Input: A polytomy G and a species tree S; 
Output: In the set H(G) of all refinements of G, a refinement H 
with the minimum number of NAD nodes. Such a refinement is 
called a solution to the MinNADref problem. 

3 A GRAPH-THEORETICAL CHARACTERIZATION 

We show (Theorem 2) that the MinNADref Problem reduces to a 
clique decomposition problem on a graph that represents the 
impact, in terms of NAD creation, of joining pairs of trees from T. 

The join graph of a polytomy. We first define a graph R based 
on the notion of join. A join is an unordered pair {Gi, G2} where 
Gi, G2 G JF. The join operation j on {Gi, G2} consists in joining 
the roots of Gi and G2 under a common parent; we denote by 
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Fig. 1. A forest a species tree S and the corresponding graph R. Each 
gene tree G of is attached to its corresponding node s(G) in S. In R, 
joins of type AD are represented by green lines. All other lines are the 
joins of type S. Non-trivial AD-components (AD-components containing 
at least two nodes) are represented by green ovals. Red lines in R repre- 
sent a vertex-disjoint clique W of Rs- Here, Rad U W has a single con- 
nected component, which leads to the binary refinement H of T with no 
NAD. After the joins of W are applied (red edges in H), the speciation- 
free forest can be joined with four joins AD (green vertices in H) 
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G\ 2 the resulting join tree. We call the join type of j= {Gi, G2}, 
and denote by jt(Gi,G2), the reconciliation label of the node 
created by joining Gi and G2 (i.e. the root of G12), where 
jt(Gi, G2) e {S, AD, NAD], respectively, for speciation, AD 
and NAD, w.r.t. the species tree S. 

We denote by R = (V, E) the join graph of JF, defined as the 
unoriented complete graph on the set of vertices V=T, where 
each edge (join) is labelled by the corresponding join type 
(Fig. 1). We denote by Rs and R^d the subgraphs of R defined 
by the edges of type, respectively, S and AD. We call a connected 
component of Rad an AD-component. 

Let T' be the new forest obtained by replacing the two trees Gx 
and G2 of T by the join tree G\ 2- The rules given below, follow- 
ing directly from the definition of speciation and duplication in 
reconciliation, are used to update the join type jt{G\^2, T) for any 
TeT\{GuG2]. 

Ruleset 1 

(1) Ifjt{Gi, T) = AD or jt(G2, T) = AD, then jt(Gi,2^ T) = AD; 

(2) Otherwise, if jt(Gi, T) = NAD or jt(G2, T) = NAD, then 
jt(Gi2, T) = NAD; 

(3) Otherwise, if lca(T) is not a descendant of lca{G\^2), then 
jt(G,,2,T) = S; 

(4) Otherwise, jt(Gi2,T) = NAD. 

Clique decomposition of the join graph. Let a join sequence 
J=(J\, J2, . . . , J\j\) be an ordered list of joins. We denote by 
T{J, i) the forest obtained after applying the first / joins of /, 
starting with T. Note that T{J, 0) = T, and that // g / is a join 
on T{J, i — 1). Let J denote the set of all possible join se- 
quences of size |:F| — 1. Clearly, applying all joins of a sequence 
J e J yields a single binary tree, and there exists a gene tree H 
e H{G) with d NADs if and only if there exists a join sequence 
J e J with d joins of type NAD. We refine this property by 
showing that there is a solution to the MinNADref problem 
where all dupHcation nodes are ancestral to all speciation nodes 
(see the tree //of Fig. 1 for an example). The proof (not shown) 
makes abundant use of Ruleset 1 . 

Lemma 1. There exists a binary refinement HeH(G) with d 
NADs if and only if there exists a join sequence J e J with d 
joins of type NAD such that, if Ji e J is the first join not of type 
S in J, then all following joins Jj,for j> i, are of type AD or NAD. 

We define a speciation tree as a gene tree in which every internal 
node is a speciation node. We deduce from the previous 
lemma that we can obtain a solution H to the MinNADref prob- 
lem by creating a forest of speciation trees first, then successively 
joining them with joins of type AD or NAD. As the nodes of R 
corresponding to the leaves of a given speciation subtree of H are 
pairwise joined by speciation edges, they form a clique in Rs (in 
Fig. 1 the cliques in red are selected and the corresponding joins 
are applied to compute refinement H). The next theorem makes 
the link between the number of NADs of i/ and the cHques of Rs- 
For a set W of vertex-disjoint cliques of Rs, we denote by 
Rad U W the graph defined by the union of the edges of Rad 
and W. 

Theorem 2. A solution to the MinNADref Problem has d NADs 
if and only if, among all graphs Rad U W where W is a set of 



vertex-disjoint cliques of Rs, at least one has d + 1 connected com- 
ponents and none has less than d + 1 connected components. 

The proof of Theorem 2 is constructive. Given an optimal set 
W of vertex-disjoint cHques of Rs, it leads to an optimal refine- 
ment H. Unfortunately, it can be shown that, given an arbitrary 
graph with two edge colours AD and S, finding if there exists a 
set W yielding a given number of connected components is an 
NP-hard problem (proof not shown). However, R is constrained 
by the structure of a species tree, which restricts the space of 
possible join graphs. An arbitrary complete graph R with edges 
labelled on the alphabet {S, AD, NAD} is said to be valid if there 
exists a species tree and a polytomy whose join graph is R. We 
characterize below the valid graphs in terms of forbidden 
induced subgraphs. The proof is partially based on well-known 
results on P4-free graphs (Cornell et al. 1985). 

Theorem 3. A graph R is valid if and only if Rs is {Pa, 2K2}-free, 
meaning that no four vertices of Rs induce a path of length 4, nor 
two vertex-disjoint edges. 

Although we have not been able to find an exact polynomial- 
time algorithm for the MinNADref problem, this very con- 
strained structure of the R graph yields a bounded heuristic for 
this problem with good theoretical properties described in the 
next section. 

Remark 1. The P4-free property, which was already introduced in 
relation with reconciliations in ( Hellmuth et al., 2013) , is of special 
interest, as many NP-hard problems on graphs have been shown to 
admit polynomial time solutions when restricted to this class of 
graphs. Unfortunately we can prove that, given an arbitrary P4- 
free graph on which we add AD edges , finding an optimal W is still 
NP-hard (proof not shown). However, the added 2K2-free restric- 
tion imposes a rigid structure on the graph at hand, and we con- 
jecture that there exists a polynomial time algorithm to find an 
optimal W. 



4 A BOUNDED HEURISTIC 

We first describe a general approach based on the notion of 
useful speciations, followed by a refinement of this approach 
with guaranteed optimality criteria. 

Definition 3. Let J={J\, . . . , J\j\) be a join sequence. A join Ji = 
{Gi ,G2] of J is a useful speciation if jt{G\ , G2) = S and Gj, G2 are 
in two different AD-components of the R graph obtained after 
applying the J \, ... , Ji-\ joins. 

Hence, if R has c AD-components, finding a zero NAD solu- 
tion becomes the problem of finding a join sequence with c — \ 
useful speciations. For example, the graph R in Figure 1 has five 
AD-components (three trivial and two non-trivial), and thus the 
four useful speciations represented by the red lines lead to a 0 
NAD solution (the binary tree H). In the general case, the prob- 
lem we face is to select as many useful speciations as possible, as 
the resulting AD-components will have to be connected by NAD 
joins. If we define a speciation-free forest as a forest T such that 
no edge of its join graph 7? is a speciation edge, following Lemma 
1, we would like to first compute a set of useful speciations that 
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results in a speciation-free forest whose join- graph has the least 
number of AD-components. 

Definition 4. A lowest useful speciation is a useful speciation 
edge {Gi, Gi} of Rs such that s{G\^2) is not the ancestor of any 
s(Gij),for {Gi, Gj] being another useful speciation edge of Rs- 

Lowest useful speciations fit naturally in the context of 
bottom-up algorithms where speciations edges that correspond 
to lower vertices of S are selected before speciations edges cor- 
responding to ancestral species. The theorem below shows that 
proceeding along these lines ensures that the resulting join se- 
quence contains at least half of the optimal number of useful 
speciation. 

Theorem 4. Let s be the maximum number of useful speciations 
leading to a solution to the MinNADref problem. Then any algo- 
rithm that creates a speciation-free forest through lowest useful 
speciations makes at least \s/2'\ useful speciations. 

This theorem implicitly defines a heuristic with approximation 
ratio 2 on the number of useful speciations that visits S in 
a bottom-up way, making useful speciations (which would 
thus be lowest useful speciations) whenever such an edge is 
available. 

We now describe an improved version of this general heuristic 
principle. A detailed example is given in Figure 2. The main idea 
is to consider a bottom-up traversal of the species tree S, and 
for each visited vertex s, to find a useful set of speciation edges 
by finding a matching in a bipartite graph. More precisely, 
for a node s e V(S), we consider the complete bipartite graph 
B = (XUY, {xy\x e X,y e Y}) such that the left (respectively 
right) subset X (respectively 7) contains all the trees G/ of 
T where s(Gi) is on the left (respectively right) subtree of s. 
Consider the two partitions ADx and ADy of X and Y, respect- 
ively, into AD-components. The key step of our heuristic is to 
find a matching M c E(B) of useful speciations between ADx 
and ADy, called a useful matching. For example, in Figure 2, 
the bipartite graph and matching illustrated for Step 3 corres- 
pond to node / and that of Step 4 to node m of S. 

Notice that not all edges of B correspond to useful 
speciations. Indeed it is possible that for some x e X and some 
y e Y, although (x, y} is a speciation edge, x and y are in the 
same AD-component of R due to another tree z not in B such 
that {x, z} and {z, y} are AD-edges. For example in Figure 1, 
although {(a), (g)} is a join of type S, the trees (a) and (g) are in 
the same AD-component of R due to the tree ((a,f), (g, h)). For a 
vertex x of X (respectively yofY), denote by AD{x) (respectively 
AD{y)) the component of ADx (respectively AD^ containing x 
(respectively y). We indicate the fact that AD{x) and AD(y) 
belong to the same AD-component in R by adding two 
dummy genes bi in AD(x) and b2 in AD(y), and a bridge {bi, b2} 
in E(B). Such bridges will be included in every matching, prevent- 
ing to include non-useful speciation edges. 

An instance P of the problem associated with a vertex ^ of 5 is 
denoted by P = (X, Y, ADx, ADy, B) where X, Y, ADx, ^Dy are 
defined as above and B is the set of bridges induced by R. The 
graph corresponding to P, i.e. the complete bipartite graph on 
sides X and Y to which we added the bridge edges B, is denoted 
by B{P). The whole method is summarized in Algorithm 1 
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Fig. 2. A species tree S and a forest of binary trees forming the 
polytomy. The trees of T are placed on S according to their LCA. The 
/, k, I and m nodes of S are annotated with the forest obtained after 
running Algorithm 2 on these nodes. Their corresponding complete bi- 
partite matching instances are illustrated at the bottom. AD joins are 
represented by dotted lines, useful matching are represented by plain 
lines (we omit drawing all the other edges of the complete bipartite 
graphs). Note that there is a bridge induced by M between (F, K) and / 
at step 4. In the fourth step, we obtain a single connected component, 
which allows, in a final step, to connect all the subtrees by AD nodes 
(final tree is on the top of the figure) 



MinNADref(:F, S) and illustrated on a simple example in 
Figure 2. 



Algorithm 1: MinNADref{T , S). 



for each node 5 of 5 in a bottom-up traversal of S do 

Let P = {X, Y, ADx, ADy, B) be the problem instance corresponding 
to s; 

Find a useful matching M of B{P) of maximum size (Algorithm 
MaxMatching below); 
Apply each speciation of M, and update T 
end for 

For each connected component C of Rad, join the trees of C under AD 
Nodes; 

If there is more than one tree remaining, join them under NAD nodes. 

Finding a useful matching of maximum size can be done in 
polynomial time by Algorithm 2. For an instance 
P = {X, Y, ADx, ADy, B), the algorithm progressively increments 
the set M of speciation edges, eventually leading to a useful 
matching of maximum size. At a given step, let Qp^m be the 
graph with vertices XUY and edges Ep^m = Ead^ M, where 
E^j) is the set of AD edges of R connecting vertices of XL) Y. 
Components ADx^ e ADx and ADy. e ADy are linked if there is 
a path in Gp^m linking a vertex of AD x^ to a vertex of ADy , and 
not linked otherwise. 

Algorithm 2:MaxMatching(X, Y, Xx, ADy, B). 

D = 0;M = B; 
while 7) /XU Tdo 

Find CeADxDADy of maximum cardinality with vertices not 
included in D, if any; assume w.l.o.g. C = ADxi G ADx', 
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for each x e C that is not incident to an edge in M do 

if there is an g Y such that AD(y) is not Hnked to C then 
Find such y with AD(y) of maximum cardinality; 
Add the vertices x and y to D and add the speciation edge {x, y} to M; 
end if 
end for 

Add remaining vertices of C to D; 
end whUe 

Theorem 5. Given an instance P = (X, Y, ADx, ADy, B), 
Algorithm 1 finds a useful matching M of maximum size. 

Algorithm 1 is a heuristic, as it may fail to give the optimal 
solution (refinement with minimum number of NADs), as in 
Figure 1 for example. In this example, a bottom-up approach 
would greedily speciate a and d, which cannot lead to the optimal 
solution. However, we prove in Theorem 6 that if transitivity 
holds for the duphcation join type, then Algorithm 1 is an 
exact algorithm for the MinNADref problem. The example of 
Figure 1 does not satisfy this property, as {(a), ((a,f), (g, h))] is a 
join of duplication type (AD), {((a,f), (g, h)), (g)] is a join of 
duplication type but {(a), (g)} is a join of speciation type. 

Theorem 6. (1) Let s be the maximum number of useful speci- 
ations leading to a solution to the MinNADref problem. Then, 
Algorithm 1 makes at least \s/2'\ useful speciations. (2) If , for 
every node s of S the instance P corresponding to s has no bridges, 
then Algorithm 1 outputs a refinement of the input polytomy with 
the maximum number of useful speciations. 

The following corollary provides an alternative formulation of 
the optimality result given by the above theorem. 

Corollary. Algorithm 1 exactly solves the MinNADref problem 
for an input (JF, S) such that each AD-component of the corres- 
ponding graph R is free from S edges (i.e. there is no S edge 
between any two vertices of a given AD-component) . 



5 GENE TREE CORRECTION 

The polytomy refinement problem is motivated by the problem 
of correcting gene trees. Duphcation nodes can be untrusted for 
many reasons, one of them being the fact that they are NADs, 
pointing to disagreements with the species tree that are not due 
to the presence of duplicated genes. Different observations tend 
to support the hypothesis that NAD nodes may point at errone- 
ous parts of a gene tree (Chauve and El-Mabrouk, 2009; 
Swenson et al, 2012). For example, the Ensembl Compara gene 
trees (Vilella et al., 2009) have all their NAD nodes labelled as 
'dubious'. In (Chauve and El-Mabrouk, 2009), using simulated 
datasets based on the species tree of 12 Drosophila species given 
in (Hahn et al., 2007) and a birth-and-death process, starting 
from a single ancestral gene, and with different gene gain/loss 
rates, it has been found that 95% of gene duplications lead to an 
AD vertex. Although suspected to be erroneous, some NAD 
nodes may still be correct, due to a high number of losses. 
However, in the context of reconcihation, the additional 
damage caused by an erroneous NAD node is the fact that it 
significantly increases the real rearrangement cost of the tree 
(Swenson et al., 2012). Therefore, tools for modifying gene 



trees according to NADs are required. We show now how 
Algorithm 1 can be used in this context. 

In (Lafond et al., 2013), a method for correcting untrusted 
duphcation nodes has been developed. The correction of a du- 
phcation node X relies on pushing x by multifurcation, which 
transforms x into a speciation node with two children being 
the roots of two polytomies. Figure 3 recalls the pushing by multi- 
furcation procedure. These polytomies are then refmed by using 
an algorithm developed in (Lafond et al, 2012), which optimizes 
the mutation cost of reconcihation. 

In the context of correcting NADs, we use the same general 
methodology, but now using Algorithm MinNADref for refining 
polytomies. Removing all NADs of a gene tree can then be done 
by iteratively applying the above methodology on the highest 
NAD node of the tree (the closest to the root). 

6 RESULTS 

Simulated data. Simulations are performed as follows. For a 
given integer n, we generate a species tree S with a random 
number of leaves between 0.5« and 3«. We then generate a 
forest T = {G\, . . . ,Gn) of cherries by randomly picking, for 
each cherry G/ G T, one node st e S and two leaves, one from 
each of the two subtrees rooted at Sf. Any leaf of S is used at 
most once (possibly by adding leafs to S if required), leading to a 
set of cherries related through joins of type S or NAD. Then, for 
each pair {G/, Gj} with join type NAD, we relate them through 
AD with probability 1/2 (or do nothing with probabihty 1/2), by 
adding a duplicated leaf. 

For each pair (S, T), we compared the number of NADs 
found by Algorithm MinNADref with the minimum number 
of NADs returned by an exact algorithm exploring all possible 
binary trees that can be constructed from T. We generated a 
thousand random S and T for each « > 4. We stopped at 
n= 14, as the brute-force algorithm is too time-costly beyond 
this point. Over all the explored datasets simulated as described 
above. Algorithm MinNADref was able to output an optimal 
solution, i.e. a refinement with the minimum number of NADs. 
Therefore, the examples on which the heuristic fails seem to be 
rare, and the algorithm performs well on polytomies of reason- 
able size. 

We then wanted to assess how the NAD minimization criter- 
ion differs from the rearrangement cost minimization criterion. 
We generated 960 random instances with forests of sizes ranging 
between 5 and 100 (10 instances for each 5 <n < 100). We 




Fig. 3. A gene tree G and a species tree S, from which we obtain G* by 
pushing X by multifurcation. Here, x is a NAD, and is pushed by taking 
the forest of maximal subtrees of G that only have genes from species in 
the si subtree (green), then another forest for the s,. subtree (red) in the 
same manner. Both these forests are joined under a polytomy, which are 
then joined under a common parent, so the root of G* is a speciation 
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compared the output of Algorithm MinNADref with that of 
Algorithm MinDLref, given in (Lafond et al., 20\2), which com- 
putes refinement minimizing the duplication + loss (DL) cost of 
reconciliation with the species tree. Both algorithms gave the 
exact same refinement for only 12 instances (1.25%). As ex- 
pected, Algorithm MinNADref always yielded a refined tree 
with a lower or equal number of NADs than the tree given by 
Algorithm MinDLref, but always had a higher or equal DL-cosi. 
However, in many cases, minimizing the DL-scoyq did not min- 
imize the number of NADs, as in 377 instances (39.3%), 
Algorithm MinNADref yielded strictly less NADs than 
Algorithm MinDLref. 
Ensembl Gene Trees. 

Next we tested the relevance of the proposed gene tree correc- 
tion methodology, by exploring how Ensembl gene trees are cor- 
rected from one release to another. As the Ensembl general 
protocol for reconstructing gene trees does not change between 
releases, the observed modifications on gene trees are more likely 
due to modifications on gene sequences. 

We used the Ensembl Genome Browser to collect all available 
gene trees containing genes from the monophyletic group of ray- 
finned fishes (Actinopterygii), and filtered each tree to preserve 
only genes from the taxa of interest (ray-fmned fish genomes). 
We selected from both Releases 74 (the present one) and 70 the 
1096 gene trees that are present in both with exactly the same set 
of genes from the monophyletic group of fishes, and with less 
NAD nodes in Release 74. We wanted to see to what extent our 
general principle of correcting an NAD by transforming it to a 
speciation node is observed by comparing Rel.70 to Rel.74. Such 
a transformation requires to preserve the clade of the corrected 
NAD node x of the initial tree, meaning that l(x) should also be 
the leaf- set of a subtree in the corrected tree. For >90% of these 
trees (993 trees), the highest NAD node clade was preserved in 
Rel.74. Moreover, among all such nodes that were corrected, i.e. 
were not NAD nodes in Rel.74 (641 trees), almost all were trans- 
formed into speciation nodes (630 trees), which strongly supports 
our correction paradigm. 
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RF Dist / Max RF Dist 

Fig. 4. Normalized RF-distance between corrected gene trees (by modi- 
fication of the highest NAD) from Rel. 70 and corresponding gene trees 
in Rel. 74. Blue curve: transformation of the highest NAD into a speci- 
ation. Red curve: trees with a non-trivial polytomy after pushing by 
multifurcation. Yellow curve: all trees 



To evaluate our methodology for correcting NADs, we 
apphed it to the highest NAD node of each of the 1096 afore- 
mentioned trees of Rel.70. Figure 4 illustrates a comparison be- 
tween the corrected trees (Rel.70C, C standing for 'Corrected') 
obtained by our methodology and those of Rel. 74. Pairwise 
comparisons are based on the normalized Robinson-Foulds 
(RF) distance (number of identical clade s divided by the total 
number of clades). The yellow curve shows a good correlation 
between Rel.70C and Rel.74, with --65% exhibiting >80% similar 
clades between Rel.70C and Rel.74. If we reduce the set of trees to 
those for which the highest NAD node is also transformed to a 
speciation node in Rel.74 (630 trees), the correlation is even better 
(blue curve of Fig. 4), with 44% of trees being identical (277 over 
630 trees) and ^^80% exhibiting >80% similar clades between 
Rel.70C and Rel.74. Now, to specifically evaluate Algorithm 
MinNADref, we further restricted the set of trees to those 
giving rise to a non-trivial polytomy (i.e. polytomy of degree > 2) 
after the pushing by multifurcation, which leads to a set of 117 
trees. Overall, the results for these trees (red curve in Fig. 4) are 
close to those observed for all trees (yellow curve) detailed above. 

We then wanted to evaluate our correction of the 117 afore- 
mentioned trees compared with trees in Rel.74. Figure 5 provides 
an evaluation of the corrected trees (yellow curve) compared with 
those in Rel. 74 (blue curve) based on the normalized RF dis- 
tance with the initial trees in Rel.70. Overall, the initial tree is 
closer to our correction than to the one of Rel.74. Therefore, 
even though gene trees of Rel.74 are likely to have stronger stat- 
istical support with respect to the gene sequences provided in 
Rel.74, our correction removes NADs while respecting as 
much as possible the given tree topology. Finally, we considered 
the reconcihation mutation cost as another evaluation criterion. 
Among the 117 trees of Rel.70C, 30 are identical to the corres- 
ponding trees in Rel. 74, and 60% have a lower mutation cost, 
which tend to support our correction compared with the tree in 
Rel.74. As for the 40% remaining trees, half of them have more 
NADs than the corresponding tree in Rel.74, which suggests that 
applying our correction to all NAD, instead of just the highest 
one, would help to obtain better results. 



117 trees having non-trivial polytomy 
0.45 n 
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Fig. 5. Normalized RF-distance between corrected trees (yellow curve) 
and Rel. 74 trees (blue curve) and original Rel. 70 trees 
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Finally, we evaluated the effect of NAD correction on the tree 
likelihood. For this purpose, we selected the 1891 Ensembl Rel.74 
gene trees of the considered monophyletic group containing at 
least one NAD, and we corrected each NAD individually. The 
sequences were aligned using ClustalW (Larkin et al, 2007) and 
the likelihood values were computed with PhyML (Guidon et al., 
2003). For a tree T and a NAD node x, denote by the tree 
obtained after correcting x. For each T and each x, we computed 
the log- likelihood ratio L(x) = logLH(T)/ logLH(Tx). Among the 
4454 NAD nodes found in the considered set of trees, 95.4% of 
the L(x) ratios were between 0.98 and 1.02. Although the cor- 
rection algorithm is not expected to outperform the Ensembl 
protocol in terms of likelihood as it ignores sequences, we 
found that the likelihood of the tree has been improved 
(L(x)>\) after correction for 43.9% of the NAD nodes. 
Moreover, 1180 (62.4%) trees contained at least one NAD 
node improving the likelihood. 

7 CONCLUSION 

The present work is dedicated to the polytomy refinement prob- 
lem. While the mutation cost of reconcihation has been used pre- 
viously as an optimization criterion for choosing an appropriate 
binary tree, here we use an alternative criterion, which is the mini- 
mization of NADs. The tractability of the MinNADref Problem 
remains open, as is the problem to select, among all possible so- 
lutions, those leading to a minimum reconciliation cost. Although 
developing a gene tree correction tool is not the purpose of this 
article, we show how our algorithm for polytomy refinement can 
be used in this context, by developing a simple algorithm allowing 
to correct a single NAD. This algorithm has been applied to trees 
of a previous Ensembl release, and the corrected trees have been 
compared with the trees of the current Ensembl release. A good 
correlation between the two sets of trees is observed, which tends 
to support our correction paradigm. While minimizing NADs 
cannot be a sufficient criterion for gene tree correction, it should 
rather be seen as one among others, such as statistical (Wu et al., 
2012), syntenic (Lafond et al., 2013) or based on reconciliation 
with the species tree (Chaudhary et al., 201 1; Lafond et al., 2013; 
Swenson et al., 2010), that can be integrated in a methodological 
framework for gene tree correction. 
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